The purpose of this task is to come up with various forecasting techniques for our term project. Initially, seasonality will be analyzed for each product according to seasonality series will be decomposed. Unfortunately, due to lack of data only daily decomposition can be applied. The reason behind such problem is Trendyol does not share data before May 2020, which causes weekly and monthly decomposition not to be made. Based on decomposition for each product, proper ARIMA models will be build. Then, possible regressor analysis will made for each product according to relations with sold count. ARIMAX models will build with specified regressors to reach better forecast values.
Required libraries and data manipulations
library(caTools)
library(xts)
library(zoo)
library(forecast)
library(urca)
library(ggplot2)
data[, is.discount_days:= 0]
rawdata <- read.csv("/Users/tolgaerdogan/Documents/GitHub/spring21-TolgaErdogann/Data/ProjectRawData.csv", header=TRUE, sep = ";")
rawdata$event_date <- as.Date(rawdata$event_date , format = "%d.%m.%Y")
alldata<-rbind(rawdata,data)
alldata<-as.data.table(alldata)
alldata$is.discount_days[alldata$event_date=="2021-03-08"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-09"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-10"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-11"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-12"] <- 1
# Some of the other discount days are set to 1 in ProjectRawData.csv
n<-400
dates<- seq(as.Date("2020-05-25"), length = n, by = "days")
#validation_dates<-seq(as.Date("2021-03-01"), length = n-280, by = "days")
earphone<-subset(alldata, alldata$product_content_id==6676673)
earphone_daily_tr<-earphone[earphone$event_date<"2021-06-21"]
earphone_daily_te<-earphone[earphone$event_date>="2021-06-21"]
plot(earphone_daily_tr$sold_count, type ='l')
Plot above shows sold counts of Xiaomi Bluetooth Earphone, which has no visible trend over time and variance seems constant, although there are outlier points.
earphone_ts_daily<-ts(earphone_daily_tr$sold_count, freq=7)
tsdisplay(earphone_ts_daily)
Necessary data manipulations are made in order to obtain daily data. ACF plot shows decreasing significant correlation until lag 8 and PACF shows only significant correlation at lag 1. It will be examined further after decomposition.
earphone_daily_additive<-decompose(earphone_ts_daily, type ="additive")
earphone_daily_multiplicative<-decompose(earphone_ts_daily, type ="multiplicative")
plot(earphone_daily_additive)
plot(earphone_daily_multiplicative)
For additive decomposition random part has less variance compared to multiplicative approach. Also, multiplicative decomposition random part’s variance changes over time. For both model has zero mean for random part.
In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test11=ur.kpss(earphone_daily_additive$random, use.lag="7")
summary(test11)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0112
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test12=ur.kpss(earphone_daily_multiplicative$random, use.lag="7")
summary(test12)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0336
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Additive decomposition gives more stationary random part compared to multiplicative according value of test statisctic and critical values. Hence, additive is chosen for further examinations.
plot(earphone_daily_additive$seasonal[1:30], type='l')
plot(earphone_daily_additive$trend, type='l')
Plots above show closer looks for trend and seasonal components of additive decomposition .
Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
earphone_xts<-xts(earphone_daily_tr$sold_count, order.by=as.Date(earphone_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(earphone_xts,on="weeks")
earphone_xts_weekly<-period.apply(earphone_xts,INDEX=weekly,FUN=mean)
plot(earphone_xts_weekly)
monthly<-endpoints(earphone_xts,on="months")
earphone_xts_monthly<-period.apply(earphone_xts,INDEX=monthly,FUN=mean)
plot(earphone_xts_monthly)
Weekly and monthly plots does not show any significant pattern or trend. If we had more data points, maybe a more meaningful result could be reached from these graphs.
tsdisplay(earphone_daily_additive$random, na.action = na.omit)
According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 3 which means that AR parameter should be 3.
earphone_model<-arima(earphone_daily_additive$random, order = c(3,0,3))
AIC(earphone_model)
## [1] 4641.866
earphone_model2<-arima(earphone_daily_additive$random, order = c(3,0,2))
AIC(earphone_model2)
## [1] 4639.877
earphone_model3<-arima(earphone_daily_additive$random, order = c(2,0,2))
AIC(earphone_model3)
## [1] 4643.746
When adjacent parameters also checked, model 2 which has (3,0,2) parameters is better in terms of AIC values, therefore it is chosen.
Fitting the model
model_forecast_error <- predict(earphone_model2, n.ahead = 11)$pred
last_trend_value <-tail(earphone_daily_additive$trend[!is.na(earphone_daily_additive$trend)],1)
seasonality=earphone_daily_additive$seasonal[295:305]
earphone_arima_forecast=model_forecast_error+seasonality+last_trend_value
earphone_daily_te$forecast<-paste(earphone_arima_forecast)
earphone_daily_te$forecast<-as.numeric(earphone_daily_te$forecast)
ggplot(earphone_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. Model that we built with ARIMA (3,0,2) parameters does not explain why earphone is not sold lately.
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 353.6364 106.4878 -0.7109682 0.8337796 251.4242 0.7109682 -0.8337796
WMAPE is 68% which is too much. ARIMA model clearly overpredicts the sales volume. This is mainly due to the constant trend assumption made.
earphone_lm1<-lm(sold_count~visit_count+is.discount_days, data=earphone_daily_tr)
summary(earphone_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = earphone_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.74 -148.02 -34.11 115.73 1524.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.580e+02 1.267e+01 28.252 < 2e-16 ***
## visit_count 4.675e-03 9.428e-04 4.958 1.06e-06 ***
## is.discount_days 1.727e+02 5.445e+01 3.172 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206.8 on 389 degrees of freedom
## Multiple R-squared: 0.0815, Adjusted R-squared: 0.07678
## F-statistic: 17.26 on 2 and 389 DF, p-value: 6.585e-08
earphone_lm2<-lm(sold_count~category_sold+price, data=earphone_daily_tr)
summary(earphone_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = earphone_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -375.08 -108.80 -24.71 84.28 1312.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.398e+03 1.146e+02 12.201 <2e-16 ***
## category_sold 1.943e-02 9.117e-03 2.132 0.0337 *
## price -7.522e+00 8.016e-01 -9.385 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 183 on 389 degrees of freedom
## Multiple R-squared: 0.281, Adjusted R-squared: 0.2773
## F-statistic: 76 on 2 and 389 DF, p-value: < 2.2e-16
earphone_lm3<-lm(sold_count~price+favored_count+category_sold, data=earphone_daily_tr)
summary(earphone_lm3)
##
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold,
## data = earphone_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.26 -105.48 -26.86 70.39 1335.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.347e+03 1.175e+02 11.465 <2e-16 ***
## price -7.230e+00 8.140e-01 -8.882 <2e-16 ***
## favored_count 2.029e-02 1.082e-02 1.874 0.0616 .
## category_sold 1.713e-02 9.171e-03 1.868 0.0625 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 182.4 on 388 degrees of freedom
## Multiple R-squared: 0.2874, Adjusted R-squared: 0.2819
## F-statistic: 52.16 on 3 and 388 DF, p-value: < 2.2e-16
earphone_lm4<-lm(sold_count~favored_count+basket_count+visit_count, data=earphone_daily_tr)
summary(earphone_lm4)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count,
## data = earphone_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -692.89 -41.70 2.91 57.37 348.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.7185477 12.4203458 1.266 0.20643
## favored_count -0.0229077 0.0083911 -2.730 0.00662 **
## basket_count 0.2587852 0.0078698 32.883 < 2e-16 ***
## visit_count -0.0002814 0.0006716 -0.419 0.67549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.2 on 388 degrees of freedom
## Multiple R-squared: 0.754, Adjusted R-squared: 0.7521
## F-statistic: 396.5 on 3 and 388 DF, p-value: < 2.2e-16
earphone_daily_te$forecastlm1<-predict(earphone_lm1,earphone_daily_te)
earphone_daily_te$forecastlm2<-predict(earphone_lm2,earphone_daily_te)
earphone_daily_te$forecastlm3<-predict(earphone_lm3,earphone_daily_te)
earphone_daily_te$forecastlm4<-predict(earphone_lm4,earphone_daily_te)
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 353.6364 106.4878 -0.2127934 0.3463527 103.3834 0.2923437 -0.2973753
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 353.6364 106.4878 -0.644767 0.743205 228.0131 0.644767 -0.743205
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 353.6364 106.4878 -0.6301594 0.7262419 222.8473 0.6301594 -0.7262419
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecastlm4)
## n mean sd bias mape mad wmape MPE
## 1 11 353.6364 106.4878 0.1521102 0.1459123 55.23074 0.1561795 0.1390922
4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square and WMAPE 4th model is explaining sales behavior better than others. Therefore, 4th model’s regressors will be chosen which are favored count, basket count and visit count.
earphone_xreg1<-cbind(earphone_daily_tr$favored_count,
earphone_daily_tr$basket_count,
earphone_daily_tr$visit_count)
earphone_xreg2<-cbind(earphone_daily_te$favored_count,
earphone_daily_te$basket_count,
earphone_daily_te$visit_count)
earphone_arimax<-Arima(earphone_ts_daily,xreg=as.matrix(earphone_xreg1),order=c(3,0,2))
earphone_daily_te$forecastarimax<-forecast(earphone_arimax,xreg=as.matrix(earphone_xreg2))$mean
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 353.6364 106.4878 -0.1613391 0.2019397 57.68401 0.1631167 -0.2008867
WMAPE is 0.16 which is little higher than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.
ggplot(earphone_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
There is excessive difference forecasted values between ARIMA and ARIMAX models. As can be seen from plot above, regressors improved model incontestably.
vacuum<-subset(alldata, alldata$product_content_id==7061886)
vacuum_daily_tr<-vacuum[vacuum$event_date<"2021-06-21"]
vacuum_daily_te<-vacuum[vacuum$event_date>="2021-06-21"]
plot(vacuum_daily_tr$sold_count, type ='l')
Fakir vacuum cleaner daily sold count plot for given days can be seen above. Sales of item has significant visible outlier points.
vacuum_ts_daily<-ts(vacuum_daily_tr$sold_count, freq=7)
tsdisplay(vacuum_ts_daily)
Necessary data manipulations are made in order to obtain daily data. ACF plot seems like there is seasonality, also only significant PACF at lag 1.
vacuum_daily_additive<-decompose(vacuum_ts_daily, type ="additive")
vacuum_daily_multiplicative<-decompose(vacuum_ts_daily, type ="multiplicative")
plot(vacuum_daily_additive)
plot(vacuum_daily_multiplicative)
Additive decomposition random part yield better result in terms of variance.
In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test13=ur.kpss(vacuum_daily_additive$random, use.lag="7")
summary(test13)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0111
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test14=ur.kpss(vacuum_daily_multiplicative$random, use.lag="7")
summary(test14)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1153
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
As expected from random part graphs, additive decomposition gave more stationary random part.
plot(vacuum_daily_additive$seasonal[1:30], type='l')
plot(vacuum_daily_additive$trend, type='l')
Plots above show closer looks for trend and seasonal components of additive decomposition .
Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
vacuum_xts<-xts(vacuum_daily_tr$sold_count, order.by=as.Date(vacuum_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(vacuum_xts,on="weeks")
vacuum_xts_weekly<-period.apply(vacuum_xts,INDEX=weekly,FUN=mean)
plot(vacuum_xts_weekly)
The peaks in November coincide with the trendyol discount days and black friday days that we determined in our project.
monthly<-endpoints(vacuum_xts,on="months")
vacuum_xts_monthly<-period.apply(vacuum_xts,INDEX=monthly,FUN=mean)
plot(vacuum_xts_monthly)
From monthly sales data, november has highest sales which is expected due to price discounts and advertisements.
tsdisplay(vacuum_daily_additive$random, na.action = na.omit)
According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 1 which means that AR parameter should be 1.
vacuum_model<-arima(vacuum_daily_additive$random, order = c(1,0,3))
AIC(vacuum_model)
## [1] 3268.003
vacuum_model2<-arima(vacuum_daily_additive$random, order = c(1,0,2))
AIC(vacuum_model2)
## [1] 3375.757
vacuum_model3<-arima(vacuum_daily_additive$random, order = c(2,0,2))
AIC(vacuum_model3)
## [1] 3354.103
When adjacent parameters also checked, still model 1 which has (1,0,3) parameters is better in terms of AIC values, therefore it is chosen.
Fitting the model
model_forecast_error <- predict(vacuum_model, n.ahead = 11)$pred
last_trend_value <-tail(vacuum_daily_additive$trend[!is.na(vacuum_daily_additive$trend)],1)
seasonality=vacuum_daily_additive$seasonal[295:305]
vacuum_arima_forecast=model_forecast_error+seasonality+last_trend_value
vacuum_daily_te$forecast<-paste(vacuum_arima_forecast)
vacuum_daily_te$forecast<-as.numeric(vacuum_daily_te$forecast)
ggplot(vacuum_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. It seems like although ARIMA model catches ups and downs on sales, it is not explaining clearly.
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 12.45455 6.990903 -0.2023177 0.904415 6.506546 0.5224234 -0.6808567
WMAPE is 56% which is high. Whereas model catched the seasonality constant trend assumption increased the errors.
vacuum_lm1<-lm(sold_count~visit_count+is.discount_days, data=vacuum_daily_tr)
summary(vacuum_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = vacuum_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.701 -15.605 -5.764 6.033 158.580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.967454 1.918260 19.271 < 2e-16 ***
## visit_count -0.003763 0.001448 -2.599 0.00969 **
## is.discount_days 113.452407 8.294838 13.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.5 on 389 degrees of freedom
## Multiple R-squared: 0.333, Adjusted R-squared: 0.3295
## F-statistic: 97.09 on 2 and 389 DF, p-value: < 2.2e-16
vacuum_lm2<-lm(sold_count~category_sold+price, data=vacuum_daily_tr)
summary(vacuum_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = vacuum_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -115.151 -15.039 2.828 10.488 186.436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 207.969490 16.240448 12.81 <2e-16 ***
## category_sold 0.103703 0.007536 13.76 <2e-16 ***
## price -0.725780 0.062501 -11.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.69 on 389 degrees of freedom
## Multiple R-squared: 0.4075, Adjusted R-squared: 0.4044
## F-statistic: 133.7 on 2 and 389 DF, p-value: < 2.2e-16
vacuum_lm3<-lm(sold_count~price+favored_count+category_sold, data=vacuum_daily_tr)
summary(vacuum_lm3)
##
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold,
## data = vacuum_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.934 -15.629 3.049 11.286 186.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 257.599267 19.818724 12.998 < 2e-16 ***
## price -0.941892 0.079986 -11.776 < 2e-16 ***
## favored_count 0.064816 0.015443 4.197 3.36e-05 ***
## category_sold 0.106979 0.007421 14.415 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.08 on 388 degrees of freedom
## Multiple R-squared: 0.4332, Adjusted R-squared: 0.4288
## F-statistic: 98.84 on 3 and 388 DF, p-value: < 2.2e-16
vacuum_lm4<-lm(sold_count~favored_count+basket_count+visit_count, data=vacuum_daily_tr)
summary(vacuum_lm4)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count,
## data = vacuum_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.248 -4.510 -0.059 4.581 141.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.298130 1.736934 -0.172 0.863809
## favored_count -0.038383 0.010279 -3.734 0.000217 ***
## basket_count 0.294578 0.008380 35.154 < 2e-16 ***
## visit_count 0.001147 0.001166 0.984 0.325961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.69 on 388 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7641
## F-statistic: 423.2 on 3 and 388 DF, p-value: < 2.2e-16
vacuum_daily_te$forecastlm1<-predict(vacuum_lm1,vacuum_daily_te)
vacuum_daily_te$forecastlm2<-predict(vacuum_lm2,vacuum_daily_te)
vacuum_daily_te$forecastlm3<-predict(vacuum_lm3,vacuum_daily_te)
vacuum_daily_te$forecastlm4<-predict(vacuum_lm4,vacuum_daily_te)
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 12.45455 6.990903 -1.717622 3.400904 21.3922 1.717622 -3.400904
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 12.45455 6.990903 -4.406717 5.271331 54.88365 4.406717 -5.271331
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 12.45455 6.990903 -3.684128 3.698435 45.88414 3.684128 -3.698435
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm4)
## n mean sd bias mape mad wmape MPE
## 1 11 12.45455 6.990903 0.1534633 0.2839136 2.947285 0.2366433 0.01318687
4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square and WMAPE 4th model is explaining sales behavior better than others. Therefore, 4th model’s regressors will be chosen which are favored count, basket count and visit count.
vacuum_xreg1<-cbind(vacuum_daily_tr$favored_count,
vacuum_daily_tr$basket_count,
vacuum_daily_tr$visit_count)
vacuum_xreg2<-cbind(vacuum_daily_te$favored_count,
vacuum_daily_te$basket_count,
vacuum_daily_te$visit_count)
vacuum_arimax<-Arima(vacuum_ts_daily,xreg=as.matrix(vacuum_xreg1),order=c(1,0,3))
vacuum_daily_te$forecastarimax<-forecast(vacuum_arimax,xreg=as.matrix(vacuum_xreg2))$mean
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 12.45455 6.990903 0.007015643 0.3925069 2.761589 0.2217334 -0.247751
WMAPE is 0.23 which is better than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.
ggplot(vacuum_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
As can be seen from plot, actual and forecasted values are close to each other and WMAPE is low as desired. Vacuum cleaner sales count explained clearly with favored count, basket count and visit count.
facial_cleanser<-subset(alldata, alldata$product_content_id==85004)
facial_cleanser_daily_tr<-facial_cleanser[facial_cleanser$event_date<"2021-06-21"]
facial_cleanser_daily_te<-facial_cleanser[facial_cleanser$event_date>="2021-06-21"]
plot(facial_cleanser_daily_tr$sold_count, type ='l')
Plot above shows daily sales of La Roche Posay face cleanser. Sales of such item increased over time and variance increased at fall and winter term.
facial_cleanser_ts_daily<-ts(facial_cleanser_daily_tr$sold_count, freq=7)
tsdisplay(facial_cleanser_ts_daily)
In order to see the daily effect in the data, a time series was created by giving frequency 7. ACF plot seems like there is seasonality, also only significant PACF at lag 1.
facial_cleanser_daily_additive<-decompose(facial_cleanser_ts_daily, type ="additive")
facial_cleanser_daily_multiplicative<-decompose(facial_cleanser_ts_daily, type ="multiplicative")
plot(facial_cleanser_daily_additive)
plot(facial_cleanser_daily_multiplicative)
Additive decomposition random part has less variance but it is not constant over time. Multiplicative decomposition random part has more variance but it is constant over time. It’s hard to make a definitive statement just by looking at the charts. Therefore, necessary tests will be applied to check stationarity.
test15=ur.kpss(facial_cleanser_daily_additive$random, use.lag="7")
summary(test15)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0092
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test16=ur.kpss(facial_cleanser_daily_multiplicative$random, use.lag="7")
summary(test16)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1831
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Additive decomposition method should be chosen since it gives more stationary random part according to test results.
plot(facial_cleanser_daily_additive$seasonal[1:30], type='l')
plot(facial_cleanser_daily_additive$trend, type='l')
Plots above show closer looks for trend and seasonal components of additive decomposition .
Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
facial_cleanser_xts<-xts(facial_cleanser_daily_tr$sold_count, order.by=as.Date(facial_cleanser_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(facial_cleanser_xts,on="weeks")
facial_cleanser_xts_weekly<-period.apply(facial_cleanser_xts,INDEX=weekly,FUN=mean)
plot(facial_cleanser_xts_weekly)
monthly<-endpoints(facial_cleanser_xts,on="months")
facial_cleanser_xts_monthly<-period.apply(facial_cleanser_xts,INDEX=monthly,FUN=mean)
plot(facial_cleanser_xts_monthly)
Sales in November are affected from discount days which are determined in our project to yield better forecast result.
tsdisplay(facial_cleanser_daily_additive$random, na.action = na.omit)
According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 1 which means that AR parameter should be 1.
facial_cleanser_model<-arima(facial_cleanser_daily_additive$random, order = c(1,0,3))
AIC(facial_cleanser_model)
## [1] 3644.424
facial_cleanser_model2<-arima(facial_cleanser_daily_additive$random, order = c(1,0,2))
AIC(facial_cleanser_model2)
## [1] 3753.453
facial_cleanser_model3<-arima(facial_cleanser_daily_additive$random, order = c(2,0,2))
AIC(facial_cleanser_model3)
## [1] 3729.383
When adjacent parameters also checked, still model 1 which has (1,0,3) parameters is better in terms of AIC values, therefore it is chosen.
Fitting the model
model_forecast_error <- predict(facial_cleanser_model, n.ahead = 11)$pred
last_trend_value <-tail(facial_cleanser_daily_additive$trend[!is.na(facial_cleanser_daily_additive$trend)],1)
seasonality=facial_cleanser_daily_additive$seasonal[295:305]
facial_cleanser_arima_forecast=model_forecast_error+seasonality+last_trend_value
facial_cleanser_daily_te$forecast<-paste(facial_cleanser_arima_forecast)
facial_cleanser_daily_te$forecast<-as.numeric(facial_cleanser_daily_te$forecast)
ggplot(facial_cleanser_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. While actual sales seesaw sharp, predicted sales has smooth changes day by day. WMAPE should also be checked.
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 71.54545 21.59335 -0.04346981 0.2029299 13.45183 0.188018 -0.1021247
WMAPE value is 17% which is not bad. To obtain a better model, appropriate regressors should be investigated.
facial_cleanser_lm1<-lm(sold_count~visit_count+is.discount_days, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = facial_cleanser_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165.19 -24.30 -12.92 8.25 342.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.492e+01 3.087e+00 17.790 < 2e-16 ***
## visit_count 8.726e-03 1.153e-03 7.571 2.71e-13 ***
## is.discount_days 1.645e+02 1.303e+01 12.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.47 on 389 degrees of freedom
## Multiple R-squared: 0.3605, Adjusted R-squared: 0.3572
## F-statistic: 109.6 on 2 and 389 DF, p-value: < 2.2e-16
facial_cleanser_lm2<-lm(sold_count~category_sold+price, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = facial_cleanser_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.96 -36.21 -15.54 20.80 297.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 358.941849 42.947829 8.358 1.15e-15 ***
## category_sold 0.024796 0.002796 8.869 < 2e-16 ***
## price -3.855490 0.555368 -6.942 1.63e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.27 on 389 degrees of freedom
## Multiple R-squared: 0.2017, Adjusted R-squared: 0.1976
## F-statistic: 49.14 on 2 and 389 DF, p-value: < 2.2e-16
facial_cleanser_lm3<-lm(sold_count~price+favored_count+category_sold, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm3)
##
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold,
## data = facial_cleanser_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.66 -24.63 -6.67 15.65 328.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 477.719639 37.749407 12.655 < 2e-16 ***
## price -5.649418 0.493944 -11.437 < 2e-16 ***
## favored_count 0.060934 0.004961 12.283 < 2e-16 ***
## category_sold 0.018515 0.002430 7.619 1.97e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.96 on 388 degrees of freedom
## Multiple R-squared: 0.4252, Adjusted R-squared: 0.4207
## F-statistic: 95.67 on 3 and 388 DF, p-value: < 2.2e-16
facial_cleanser_lm4<-lm(sold_count~favored_count+basket_count+visit_count+is.discount_days+category_sold, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm4)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count +
## is.discount_days + category_sold, data = facial_cleanser_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.291 -9.912 0.349 9.431 137.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.023e+01 2.270e+00 -4.507 8.74e-06 ***
## favored_count -4.531e-03 3.528e-03 -1.284 0.1999
## basket_count 2.934e-01 8.980e-03 32.673 < 2e-16 ***
## visit_count -1.503e-02 8.611e-04 -17.458 < 2e-16 ***
## is.discount_days 5.102e+01 7.080e+00 7.206 3.06e-12 ***
## category_sold 3.477e-03 1.211e-03 2.872 0.0043 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.91 on 386 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8621
## F-statistic: 489.9 on 5 and 386 DF, p-value: < 2.2e-16
facial_cleanser_daily_te$forecastlm1<-predict(facial_cleanser_lm1,facial_cleanser_daily_te)
facial_cleanser_daily_te$forecastlm2<-predict(facial_cleanser_lm2,facial_cleanser_daily_te)
facial_cleanser_daily_te$forecastlm3<-predict(facial_cleanser_lm3,facial_cleanser_daily_te)
facial_cleanser_daily_te$forecastlm4<-predict(facial_cleanser_lm4,facial_cleanser_daily_te)
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 71.54545 21.59335 -0.09109016 0.2302715 14.5634 0.2035545 -0.1604838
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 71.54545 21.59335 -0.6463231 0.6857502 46.24148 0.6463231 -0.6857502
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 71.54545 21.59335 -0.1772046 0.227159 16.42899 0.2296301 -0.1677145
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm4)
## n mean sd bias mape mad wmape MPE
## 1 11 71.54545 21.59335 0.03935856 0.2557536 16.8197 0.2350911 0.05802043
4 different multiple linear regression model is built to understand which regressors are explaining the model better. In terms of adjusted R square, 4th model is explaining sales behavior better than others. When WMAPE is checked, 1st model has minimum value among all models. Nevertheless, 1st and 4th model has close WMAPE values. Therefore, 4th model’s regressors will be chosen which are favored count, basket count, visit count, category sold and discount days.
facial_cleanser_xreg1<-cbind(facial_cleanser_daily_tr$favored_count,
facial_cleanser_daily_tr$basket_count,
facial_cleanser_daily_tr$category_sold,
facial_cleanser_daily_tr$is.discount_days,
facial_cleanser_daily_tr$visit_count)
facial_cleanser_xreg2<-cbind(facial_cleanser_daily_te$favored_count,
facial_cleanser_daily_te$basket_count,
facial_cleanser_daily_te$category_sold,
facial_cleanser_daily_te$is.discount_days,
facial_cleanser_daily_te$visit_count)
facial_cleanser_arimax<-Arima(facial_cleanser_ts_daily,xreg=as.matrix(facial_cleanser_xreg1),order=c(1,0,3))
facial_cleanser_daily_te$forecastarimax<-forecast(facial_cleanser_arimax,xreg=as.matrix(facial_cleanser_xreg2))$mean
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 71.54545 21.59335 -0.07684045 0.2303227 15.23747 0.2129761 -0.06672833
WMAPE is 0.18 which is better than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.
ggplot(facial_cleanser_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
As can be seen from plot, actual and forecasted values are close to each other and WMAPE is low as desired. Facial cleanser sales count explained clearly with favored count, basket count, category sold, visit count and discount days.
babywipe<-subset(alldata, alldata$product_content_id==4066298)
babywipe_daily_tr<-babywipe[babywipe$event_date<"2021-06-21"]
babywipe_daily_te<-babywipe[babywipe$event_date>="2021-06-21"]
plot(babywipe_daily_tr$sold_count, type ='l')
The chart above shows a visualized version of the data. Looking at the graph, it can be said that there are outlier points and seasonality is visible.
babywipe_ts_daily<-ts(babywipe_daily_tr$sold_count, freq=7)
tsdisplay(babywipe_ts_daily)
In order to obtain a time series, necessary manipulations were made on the data. ACF plot shows significant correlation at lag 1,2 and 3, which shows that data has trend.
babywipe_daily_additive<-decompose(babywipe_ts_daily, type ="additive")
babywipe_daily_multiplicative<-decompose(babywipe_ts_daily, type ="multiplicative")
plot(babywipe_daily_additive)
plot(babywipe_daily_multiplicative)
The trend, seasonal and random graphs, which are formed when decomposed as additive and multiplicative, are shown above. Zero mean assumption looks like satisfied for both additive and multiplicative random parts. Neverthless, additive random part has visibly less variance compared to multiplicative random part.
In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test17=ur.kpss(babywipe_daily_additive$random, use.lag="7")
summary(test17)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0589
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test18=ur.kpss(babywipe_daily_multiplicative$random, use.lag="7")
summary(test18)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0859
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The test-statistic for both decomposition method is smaller than the crucial value, indicating that we are unable to reject the null hypothesis. As a result, we can conclude the random data is stationary. However, additive decomposition has smaller test statistic. So, it should be used for further analysis.
plot(babywipe_daily_additive$seasonal[1:30], type='l')
plot(babywipe_daily_additive$trend, type='l')
Plots above show closer looks for trend and seasonal components of additive decomposition .
Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
babywipe_xts<-xts(babywipe_daily_tr$sold_count, order.by=as.Date(babywipe_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(babywipe_xts,on="weeks")
babywipe_xts_weekly<-period.apply(babywipe_xts,INDEX=weekly,FUN=mean)
plot(babywipe_xts_weekly)
monthly<-endpoints(babywipe_xts,on="months")
babywipe_xts_monthly<-period.apply(babywipe_xts,INDEX=monthly,FUN=mean)
plot(babywipe_xts_monthly)
Sales in November are influenced by discount days which are specified to get a better prediction outcome.
tsdisplay(babywipe_daily_additive$random, na.action = na.omit)
According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 3 which means that AR parameter should be 3.
babywipe_model<-arima(babywipe_daily_additive$random, order = c(3,0,3))
AIC(babywipe_model)
## [1] 5217.269
babywipe_model2<-arima(babywipe_daily_additive$random, order = c(2,0,3))
AIC(babywipe_model2)
## [1] 5194.018
babywipe_model3<-arima(babywipe_daily_additive$random, order = c(1,0,3))
AIC(babywipe_model3)
## [1] 5227.107
Model 2 is better in terms of AIC values, therefore it is chosen.
Fitting the model
model_forecast_error <- predict(babywipe_model2, n.ahead = 11)$pred
last_trend_value <-tail(babywipe_daily_additive$trend[!is.na(babywipe_daily_additive$trend)],1)
seasonality=babywipe_daily_additive$seasonal[295:305]
babywipe_arima_forecast=model_forecast_error+seasonality+last_trend_value
babywipe_daily_te$forecast<-paste(babywipe_arima_forecast)
babywipe_daily_te$forecast<-as.numeric(babywipe_daily_te$forecast)
ggplot(babywipe_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. Seasonal effects are seen clearly between actual and fitted values. Forecasted values are higher than actual due to the fact that last trend value is used for all predictions. If different model was built in order to predict trend values for coming days, better result may achieved.
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 521 143.7651 -0.615196 0.6609404 320.5171 0.615196 -0.6609404
WMAPE is 0.61. Model can be improved by adding regressor which will lower WMAPE.
babywipe_lm1<-lm(sold_count~visit_count+is.discount_days, data=babywipe_daily_tr)
summary(babywipe_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = babywipe_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1104.11 -144.55 -81.80 28.45 2936.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 270.29626 20.64823 13.09 < 2e-16 ***
## visit_count 0.03880 0.00457 8.49 4.4e-16 ***
## is.discount_days 984.69064 92.35038 10.66 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 350 on 389 degrees of freedom
## Multiple R-squared: 0.3374, Adjusted R-squared: 0.334
## F-statistic: 99.06 on 2 and 389 DF, p-value: < 2.2e-16
babywipe_lm2<-lm(sold_count~category_sold+price, data=babywipe_daily_tr)
summary(babywipe_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = babywipe_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -685.84 -76.99 8.36 80.37 1345.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 924.398434 194.061445 4.763 2.69e-06 ***
## category_sold 0.208919 0.006506 32.111 < 2e-16 ***
## price -12.929247 2.651374 -4.876 1.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.5 on 389 degrees of freedom
## Multiple R-squared: 0.8139, Adjusted R-squared: 0.813
## F-statistic: 850.8 on 2 and 389 DF, p-value: < 2.2e-16
babywipe_lm3<-lm(sold_count~price+favored_count+basket_count+category_sold, data=babywipe_daily_tr)
summary(babywipe_lm3)
##
## Call:
## lm(formula = sold_count ~ price + favored_count + basket_count +
## category_sold, data = babywipe_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -644.26 -41.18 10.37 48.97 1169.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.469328 124.209428 0.463 0.644
## price -1.220304 1.695079 -0.720 0.472
## favored_count -0.363498 0.023540 -15.442 <2e-16 ***
## basket_count 0.283905 0.011306 25.111 <2e-16 ***
## category_sold 0.113418 0.005918 19.163 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.1 on 387 degrees of freedom
## Multiple R-squared: 0.93, Adjusted R-squared: 0.9293
## F-statistic: 1286 on 4 and 387 DF, p-value: < 2.2e-16
babywipe_lm4<-lm(sold_count~favored_count+basket_count+category_sold+is.discount_days, data=babywipe_daily_tr)
summary(babywipe_lm4)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + category_sold +
## is.discount_days, data = babywipe_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -740.20 -39.02 11.37 47.56 1125.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.788167 8.588435 -3.003 0.002850 **
## favored_count -0.351183 0.023274 -15.089 < 2e-16 ***
## basket_count 0.283635 0.010722 26.454 < 2e-16 ***
## category_sold 0.108112 0.005978 18.084 < 2e-16 ***
## is.discount_days 121.622410 33.420190 3.639 0.000311 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 112.2 on 387 degrees of freedom
## Multiple R-squared: 0.9322, Adjusted R-squared: 0.9315
## F-statistic: 1331 on 4 and 387 DF, p-value: < 2.2e-16
babywipe_daily_te$forecastlm1<-predict(babywipe_lm1,babywipe_daily_te)
babywipe_daily_te$forecastlm2<-predict(babywipe_lm2,babywipe_daily_te)
babywipe_daily_te$forecastlm3<-predict(babywipe_lm3,babywipe_daily_te)
babywipe_daily_te$forecastlm4<-predict(babywipe_lm4,babywipe_daily_te)
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 521 143.7651 -0.007162319 0.1323333 62.80507 0.1205472 -0.0402405
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 521 143.7651 -0.4633883 0.4653101 241.4253 0.4633883 -0.4653101
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 521 143.7651 -0.2103372 0.2161906 109.5857 0.2103372 -0.2161906
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm4)
## n mean sd bias mape mad wmape MPE
## 1 11 521 143.7651 -0.1886224 0.1947759 98.27226 0.1886224 -0.1947759
4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square, 4th model is explaining sales behavior better than others. When WMAPE is checked, 1st model has minimum value among all models. Nevertheless, 1st and 4th model has close WMAPE values. Therefore, 4th model’s regressors will be chosen which are favored count, basket count, category sold and whether is it discount day or not.
babywipe_xreg1<-cbind(babywipe_daily_tr$favored_count,
babywipe_daily_tr$basket_count,
babywipe_daily_tr$category_sold,
babywipe_daily_tr$is.discount_days)
babywipe_xreg2<-cbind(babywipe_daily_te$favored_count,
babywipe_daily_te$basket_count,
babywipe_daily_te$category_sold,
babywipe_daily_te$is.discount_days)
babywipe_arimax<-Arima(babywipe_ts_daily,xreg=as.matrix(babywipe_xreg1),order=c(2,0,3))
babywipe_daily_te$forecastarimax<-forecast(babywipe_arimax,xreg=as.matrix(babywipe_xreg2))$mean
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 521 143.7651 -0.1168833 0.158371 85.02354 0.163193 -0.1031291
WMAPE is 0.14 which is desirable for forecasting. Plot below shows forecasted values with ARIMAX vs actual sales.
ggplot(babywipe_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
As can be seen from plot, actual and forecasted values are close to each other and WMAPE is low as desired. Wet wipes sales count explained clearly with favored count, basket count, category sold and discount days.
mont<-subset(alldata, alldata$product_content_id==48740784)
#mont_xts<-xts(mont,order.by=mont$event_date)
mont_daily_tr<-mont[mont$event_date<"2021-06-21"]
mont_daily_te<-mont[mont$event_date>="2021-06-21"]
plot(mont$sold_count, type = 'l')
Plot above shows sold counts of Altınyıldız Classics coat. Coat were available only at winter and at certain days.
mont_ts_daily<-ts(mont_daily_tr$sold_count, freq=7)
tsdisplay(mont_ts_daily)
Necessary data manipulations are made in order to construct a time series object with frequency 7. We see autocorrelation at lag 7 resulting that there is seasonality in weekly level. It will be examined further after decomposition.
mont_daily_additive<-decompose(mont_ts_daily, type ="additive")
mont_daily_multiplicative<-decompose(mont_ts_daily, type ="multiplicative")
plot(mont_daily_additive)
plot(mont_daily_multiplicative)
Since the coat is not sold at certain months constant variance assumption is violated in both cases.
In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test1=ur.kpss(mont_daily_additive$random, use.lag="7")
summary(test1)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0104
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test2=ur.kpss(mont_daily_multiplicative$random, use.lag="7")
summary(test2)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1086
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Test statistic of additive decomposition resulted better therefore it is chosen for further examinations.
plot(mont_daily_additive$seasonal[1:30], type='l')
plot(mont_daily_additive$trend, type='l')
Plots above show closer looks for trend and seasonal components of additive decomposition. We can clearly see a seasonality in daily level.
Since there are not enough data points, decomposition cannot made at different levels. Although we cannot decompose, weekly and monthly plots can be seen below.
Weekly
mont_xts<-xts(mont_daily_tr$sold_count, order.by=as.Date(mont_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(mont_xts,on="weeks")
mont_xts_weekly<-period.apply(mont_xts,INDEX=weekly,FUN=mean)
plot(mont_xts_weekly)
Monthly
mont_xts<-xts(mont_daily_tr$sold_count, order.by=as.Date(mont_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(mont_xts,on="months")
mont_xts_weekly<-period.apply(mont_xts,INDEX=monthly,FUN=mean)
plot(mont_xts_weekly)
tsdisplay(mont_daily_additive$random, na.action = na.omit)
mont_model<-arima(mont_daily_additive$random, order = c(3,0,4))
summary(mont_model)
##
## Call:
## arima(x = mont_daily_additive$random, order = c(3, 0, 4))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 intercept
## 0.2799 0.3529 -0.4270 -0.8088 -0.8964 0.5796 0.1256 0e+00
## s.e. 0.1689 0.1657 0.0903 0.1700 0.2437 0.0797 0.0797 4e-04
##
## sigma^2 estimated as 3.751: log likelihood = -807.21, aic = 1632.41
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01248532 1.936705 0.7504695 86.26919 199.7648 0.6242751
## ACF1
## Training set 0.007485642
Random component of the additive model is investigated in terms of autocorrelation and partial autocorrelation to find appropriate p,d,q values. There is a significant autocorrelation at lag 3 in ACF and partial autocorrelation decreases after lag 4, therefore MA 3 and AR 4 can be used
mont_model2<-arima(mont_daily_additive$random, order = c(2,0,4))
AIC(mont_model2)
## [1] 1628.922
Model 2 is better in terms of AIC values, therefore it is chosen to perform the forecast.
Fitting the model
model_forecast_error <- predict(mont_model2, n.ahead = 11)$pred
last_trend_value <-tail(mont_daily_additive$trend[!is.na(mont_daily_additive$trend)],1)
seasonality=mont_daily_additive$seasonal[295:305]
mont_arima_forecast=model_forecast_error+seasonality+last_trend_value
mont_daily_te$forecast<-paste(mont_arima_forecast)
mont_daily_te$forecast<-as.numeric(mont_daily_te$forecast)
ggplot(mont_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
error_test(mont_daily_te$sold_count,mont_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 1.545455 1.128152 -0.730025 Inf 1.397975 0.9045719 -Inf
In this case, the value of WMAPE turned out to be really high, because on some days the product is not available, and ARIMA could not predict whether the product would be sold
We constructed three different linear model and compared the wmape values to decide on the regressors that will be added to ARIMA model
mont_lm1<-lm(sold_count~visit_count+is.discount_days, data=mont_daily_tr)
summary(mont_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = mont_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.329 -0.671 -0.663 -0.663 51.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.663459 0.189687 3.498 0.000524 ***
## visit_count 0.007301 0.002587 2.822 0.005016 **
## is.discount_days 1.592986 0.910109 1.750 0.080851 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.448 on 389 degrees of freedom
## Multiple R-squared: 0.02602, Adjusted R-squared: 0.02102
## F-statistic: 5.197 on 2 and 389 DF, p-value: 0.005924
mont_lm2<-lm(sold_count~category_sold+price, data=mont_daily_tr)
summary(mont_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = mont_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.355 -2.910 -0.698 0.334 47.381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5755049 1.3268375 0.434 0.66545
## category_sold 0.0001318 0.0008419 0.157 0.87594
## price 0.0056698 0.0021337 2.657 0.00923 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.07 on 96 degrees of freedom
## (293 observations deleted due to missingness)
## Multiple R-squared: 0.06883, Adjusted R-squared: 0.04943
## F-statistic: 3.548 on 2 and 96 DF, p-value: 0.03261
mont_lm3<-lm(sold_count~favored_count+basket_count, data=mont_daily_tr)
summary(mont_lm3)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = mont_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6804 0.0378 0.1035 0.1035 9.8401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.103516 0.085889 -1.205 0.229
## favored_count 0.003277 0.014855 0.221 0.826
## basket_count 0.171107 0.004247 40.291 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.527 on 389 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.808
## F-statistic: 823.6 on 2 and 389 DF, p-value: < 2.2e-16
mont_daily_te$forecastlm1<-predict(mont_lm1,mont_daily_te)
mont_daily_te$forecastlm2<-predict(mont_lm2,mont_daily_te)
mont_daily_te$forecastlm3<-predict(mont_lm3,mont_daily_te)
error_test(mont_daily_te$sold_count,mont_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 1.545455 1.128152 0.07639808 Inf 0.8102629 0.5242877 -Inf
error_test(mont_daily_te$sold_count,mont_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 1.545455 1.128152 -0.8498868 Inf 1.55831 1.008318 -Inf
error_test(mont_daily_te$sold_count,mont_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 1.545455 1.128152 0.5395895 Inf 0.9814649 0.6350655 -Inf
Model 1 performed better with a wmape value of 55%. Despite its adjusted R-square value Model 1 resulted better at wmape therefore discount days information and visit counts will be used as regressors.
mont_xreg1<-cbind(mont_daily_tr$is.discount_days,
mont_daily_tr$visit_count)
mont_xreg2<-cbind(mont_daily_te$is.discount_days,
mont_daily_te$visit_count)
mont_arimax<-Arima(mont_ts_daily,xreg=as.matrix(mont_xreg1),order=c(2,0,4))
mont_daily_te$forecastarimax<-forecast(mont_arimax,xreg=as.matrix(mont_xreg2))$mean
error_test(mont_daily_te$sold_count,mont_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 1.545455 1.128152 0.160217 Inf 0.8518484 0.551196 -Inf
With the addition of regressors WMAPE value decreased from 85% to 57%.
ggplot(mont_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Above plot shows the forecasted and actual values with the ARIMAX model. Model significantly improved with the addition of regressors.
oralb<-subset(alldata, alldata$product_content_id==32939029)
#oralb_xts<-xts(oralb,order.by=oralb$event_date)
oralb_daily_tr<-oralb[oralb$event_date<"2021-06-21"]
oralb_daily_te<-oralb[oralb$event_date>="2021-06-21"]
plot(oralb$sold_count, type='l')
Plot above shows sold counts of Oral-B Toothbrush. Variance seems increasing and also sales are increasing over time.
oralb_ts_daily<-ts(oralb_daily_tr$sold_count, freq=7)
tsdisplay(oralb_ts_daily)
Necessary data manipulations are made in order to obtain daily data. ACF plot shows high correlation and PACF shows only significant correlation at lag 1. It will be examined further after decomposition.
oralb_daily_additive<-decompose(oralb_ts_daily, type ="additive")
oralb_daily_multiplicative<-decompose(oralb_ts_daily, type ="multiplicative")
plot(oralb_daily_additive)
plot(oralb_daily_multiplicative)
For additive decomposition random part has increasing variance, which does not satisfy constants variance assumption.For multiplicative decomposition random part has decreasing variance, which does not satisfy constants variance assumption as well.
In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test1=ur.kpss(oralb_daily_additive$random, use.lag="7")
summary(test1)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0132
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test2=ur.kpss(oralb_daily_multiplicative$random, use.lag="7")
summary(test2)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.3125
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Test statistic of additive decomposition resulted better therefore it is chosen for further examinations.
plot(oralb_daily_additive$seasonal[1:30], type='l')
plot(oralb_daily_additive$trend, type='l')
Plots above show closer looks for trend and seasonal components of additive decomposition. We can clearly see a seasonality in daily level.
Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
Weekly
oralb_xts<-xts(oralb_daily_tr$sold_count, order.by=as.Date(oralb_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(oralb_xts,on="weeks")
oralb_xts_weekly<-period.apply(oralb_xts,INDEX=weekly,FUN=mean)
plot(oralb_xts_weekly)
Monthly
oralb_xts<-xts(oralb_daily_tr$sold_count, order.by=as.Date(oralb_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(oralb_xts,on="months")
oralb_xts_weekly<-period.apply(oralb_xts,INDEX=monthly,FUN=mean)
plot(oralb_xts_weekly)
Weekly and monthly plots shows demand for such product increased over time. Even so, there are still outlier weeks and months that cannot be explained with only sold item count. Price or promotions may affect such outlier points.
To decide on the parameters of ARIMA model, autocorrelation and partial autocorrelation of the random component of additive model are investigated.
tsdisplay(oralb_daily_additive$random, na.action = na.omit)
oralb_model<-arima(oralb_daily_additive$random, order = c(3,0,3))
summary(oralb_model)
##
## Call:
## arima(x = oralb_daily_additive$random, order = c(3, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept
## 0.2399 0.0993 -0.2964 -0.2553 -0.6100 -0.1347 -0.0024
## s.e. 0.2453 0.2464 0.1538 0.2488 0.2466 0.1419 0.0251
##
## sigma^2 estimated as 787.9: log likelihood = -1837.79, aic = 3691.58
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.4667117 28.07005 19.36278 49.37787 226.8666 0.6808855
## ACF1
## Training set -0.005579994
As can be seen from the graphs there is significant negative autocorrelation in lag 3 at ACF and partial autocorrelation decreased significantly after lag 3 in PACF therefore parameters would be selected as (3,0,3)
oralb_model2<-arima(oralb_daily_additive$random, order = c(2,0,3))
AIC(oralb_model2)
## [1] 3690.505
Model 2 is better in terms of AIC values, therefore it is chosen to construct the ARIMA model
Fitting the model
model_forecast_error <- predict(oralb_model2, n.ahead = 11)$pred
last_trend_value <-tail(oralb_daily_additive$trend[!is.na(oralb_daily_additive$trend)],1)
seasonality=oralb_daily_additive$seasonal[295:305]
oralb_arima_forecast=model_forecast_error+seasonality+last_trend_value
oralb_daily_te$forecast<-paste(oralb_arima_forecast)
oralb_daily_te$forecast<-as.numeric(oralb_daily_te$forecast)
ggplot(oralb_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
From above plot, it can be seen that we are overpredicting, it is mainly due to the constant trend assumption, with the addition of regressors this bias may be decreased.
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 80.27273 41.46587 -1.153151 1.661979 92.56657 1.153151 -1.661979
WMAPE value is too much in this case because of the reasons mentioned above.
To decide on which regressors to use, 3 different linear models are constructed and compared in terms of their error rates.
oralb_lm1<-lm(sold_count~visit_count+is.discount_days, data=oralb_daily_tr)
summary(oralb_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = oralb_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -124.33 -29.55 -16.55 15.01 235.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.055e+01 3.156e+00 12.849 < 2e-16 ***
## visit_count 2.343e-02 7.527e-04 31.134 < 2e-16 ***
## is.discount_days 4.858e+01 1.371e+01 3.543 0.000443 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.01 on 389 degrees of freedom
## Multiple R-squared: 0.7192, Adjusted R-squared: 0.7178
## F-statistic: 498.2 on 2 and 389 DF, p-value: < 2.2e-16
oralb_lm2<-lm(sold_count~category_sold+price, data=oralb_daily_tr)
summary(oralb_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = oralb_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -242.52 -56.85 -26.14 36.76 380.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.950e+02 6.511e+01 -4.531 7.86e-06 ***
## category_sold 5.331e-02 5.824e-03 9.154 < 2e-16 ***
## price 2.501e+00 4.721e-01 5.297 2.00e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.45 on 380 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.2068, Adjusted R-squared: 0.2026
## F-statistic: 49.54 on 2 and 380 DF, p-value: < 2.2e-16
oralb_lm3<-lm(sold_count~favored_count+basket_count, data=oralb_daily_tr)
summary(oralb_lm3)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = oralb_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -182.608 -10.381 -3.624 7.994 115.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.070275 2.068327 1.484 0.139
## favored_count -0.036779 0.006199 -5.933 6.59e-09 ***
## basket_count 0.261759 0.007315 35.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.96 on 389 degrees of freedom
## Multiple R-squared: 0.9129, Adjusted R-squared: 0.9125
## F-statistic: 2039 on 2 and 389 DF, p-value: < 2.2e-16
oralb_daily_te$forecastlm1<-predict(oralb_lm1,oralb_daily_te)
oralb_daily_te$forecastlm2<-predict(oralb_lm2,oralb_daily_te)
oralb_daily_te$forecastlm3<-predict(oralb_lm3,oralb_daily_te)
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 80.27273 41.46587 -0.2773368 0.5155861 26.13322 0.3255554 -0.4866785
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 80.27273 41.46587 -0.2715985 0.6714054 36.86528 0.4592504 -0.5582226
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 80.27273 41.46587 0.1447854 0.1452873 13.09143 0.1630869 0.117745
Model 3 resulted a WMAPE of 14% it is much lower than the wmape of the ARIMA model. Favored count and basket count will be added to the model.
oralb_xreg1<-cbind(oralb_daily_tr$favored_count,
oralb_daily_tr$basket_count)
oralb_xreg2<-cbind(oralb_daily_te$favored_count,
oralb_daily_te$basket_count)
oralb_arimax<-Arima(oralb_ts_daily,xreg=as.matrix(oralb_xreg1),order=c(3,0,3))
oralb_daily_te$forecastarimax<-forecast(oralb_arimax,xreg=as.matrix(oralb_xreg2))$mean
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 80.27273 41.46587 0.06866556 0.1598583 12.86963 0.1603238 -0.0009525859
With the addition of regressors wmape value increased to 14.4%, in this case ARIMA doesn’t perform well.
ggplot(oralb_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Above plot shows the fitted values of the last model. Model fits well with the actual values.
bikini1<-subset(alldata, alldata$product_content_id==73318567)
bikini1_daily_tr<-bikini1[bikini1$event_date<"2021-06-21"]
bikini1_daily_te<-bikini1[bikini1$event_date>="2021-06-21"]
plot(bikini1$sold_count, type = 'l')
Plot above shows the sold count for bikini1. Sale volumes increased at summer and also there were sales at March
bikini1_ts_daily<-ts(bikini1_daily_tr$sold_count, freq=7)
tsdisplay(bikini1_ts_daily)
Autocorrelation values are decreasing and PACF shows significant correlation at lag1.
bikini1_daily_additive<-decompose(bikini1_ts_daily, type ="additive")
bikini1_daily_multiplicative<-decompose(bikini1_ts_daily, type ="multiplicative")
plot(bikini1_daily_additive)
plot(bikini1_daily_multiplicative)
Constant variance assumption is violated in both case In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test1=ur.kpss(bikini1_daily_additive$random, use.lag="7")
summary(test1)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0259
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test2=ur.kpss(bikini1_daily_multiplicative$random, use.lag="7")
summary(test2)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1168
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Test statistic of additivie decomposition resulted better, therefore it is chosen for further examination.
plot(bikini1_daily_additive$seasonal[1:30], type='l')
plot(bikini1_daily_additive$trend, type='l')
Checking the seasonal component, we can clearly see seasonality in daily level. Plots above show closer looks for trend and seasonal components of additive decomposition. Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
Weekly
bikini1_xts<-xts(bikini1_daily_tr$sold_count, order.by=as.Date(bikini1_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(bikini1_xts,on="weeks")
bikini1_xts_weekly<-period.apply(bikini1_xts,INDEX=weekly,FUN=mean)
plot(bikini1_xts_weekly)
Monthly
bikini1_xts<-xts(bikini1_daily_tr$sold_count, order.by=as.Date(bikini1_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(bikini1_xts,on="months")
bikini1_xts_weekly<-period.apply(bikini1_xts,INDEX=monthly,FUN=mean)
plot(bikini1_xts_weekly)
To decide on the parameters of ARIMA model, autocorrelation and partial autocorrelation of the random component of additive model are investigated.
tsdisplay(bikini1_daily_additive$random, na.action = na.omit)
bikini1_model<-arima(bikini1_daily_additive$random, order = c(5,0,5))
summary(bikini1_model)
##
## Call:
## arima(x = bikini1_daily_additive$random, order = c(5, 0, 5))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.1762 -0.3644 -0.4165 0.5881 -0.3656 -0.2840 0.0761 0.1389
## s.e. 0.1465 0.0546 0.0638 0.0727 0.0997 0.1591 0.0623 0.0520
## ma4 ma5 intercept
## -0.9411 0.0101 0.0003
## s.e. 0.0551 0.1546 0.0099
##
## sigma^2 estimated as 77.51: log likelihood = -1392.75, aic = 2809.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.04492755 8.803879 3.927318 -37.79243 242.1186 0.6478351
## ACF1
## Training set -0.0001361801
As can be seen from the graphs there is significant negative autocorrelation in lag 5 at ACF and partial autocorrelation decreased significantly after lag 5 in PACF therefore parameters would be selected as (5,0,5)
bikini1_model2<-arima(bikini1_daily_additive$random, order = c(5,0,4))
AIC(bikini1_model2)
## [1] 2807.508
Model 2 is better in terms of AIC values, therefore it is chosen to construct the ARIMA model
Fitting the model
model_forecast_error <- predict(bikini1_model2, n.ahead = 11)$pred
last_trend_value <-tail(bikini1_daily_additive$trend[!is.na(bikini1_daily_additive$trend)],1)
seasonality=bikini1_daily_additive$seasonal[295:305]
bikini1_arima_forecast=model_forecast_error+seasonality+last_trend_value
bikini1_daily_te$forecast<-paste(bikini1_arima_forecast)
bikini1_daily_te$forecast<-as.numeric(bikini1_daily_te$forecast)
ggplot(bikini1_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
From above plot, it can be seen that we are overpredicting, it is mainly due to the constant trend assumption, with the addition of regressors this bias may be decreased.
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 26.36364 10.46205 -0.8187966 1.07578 21.83745 0.8283172 -1.070439
WMAPE value is too much in this case because of the reasons mentioned above.
To decide on which regressors to use, 3 different linear models are constructed and compared in terms of their error rates.
bikini1_lm1<-lm(sold_count~price+is.discount_days, data=bikini1_daily_tr)
summary(bikini1_lm1)
##
## Call:
## lm(formula = sold_count ~ price + is.discount_days, data = bikini1_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.84 -62.22 -17.22 44.28 219.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -196.957 842.376 -0.234 0.8156
## price 4.387 14.008 0.313 0.7547
## is.discount_days -58.824 32.447 -1.813 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.76 on 108 degrees of freedom
## (281 observations deleted due to missingness)
## Multiple R-squared: 0.03112, Adjusted R-squared: 0.01318
## F-statistic: 1.735 on 2 and 108 DF, p-value: 0.1814
bikini1_lm2<-lm(sold_count~category_sold+price, data=bikini1_daily_tr)
summary(bikini1_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = bikini1_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.98 -28.98 -12.96 32.01 149.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.066e+03 7.208e+02 2.867 0.00499 **
## category_sold 2.333e-02 2.778e-03 8.399 1.95e-13 ***
## price -3.425e+01 1.203e+01 -2.847 0.00529 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.87 on 108 degrees of freedom
## (281 observations deleted due to missingness)
## Multiple R-squared: 0.3961, Adjusted R-squared: 0.3849
## F-statistic: 35.42 on 2 and 108 DF, p-value: 1.487e-12
bikini1_lm3<-lm(sold_count~favored_count+basket_count, data=bikini1_daily_tr)
summary(bikini1_lm3)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = bikini1_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.828 -0.148 -0.148 -0.148 34.767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.148174 0.435072 0.341 0.734
## favored_count -0.015716 0.001207 -13.017 <2e-16 ***
## basket_count 0.247511 0.003448 71.790 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.003 on 389 degrees of freedom
## Multiple R-squared: 0.9719, Adjusted R-squared: 0.9717
## F-statistic: 6721 on 2 and 389 DF, p-value: < 2.2e-16
bikini1_daily_te$forecastlm1<-predict(bikini1_lm1,bikini1_daily_te)
bikini1_daily_te$forecastlm2<-predict(bikini1_lm2,bikini1_daily_te)
bikini1_daily_te$forecastlm3<-predict(bikini1_lm3,bikini1_daily_te)
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 26.36364 10.46205 -1.511946 1.846804 39.86038 1.511946 -1.846804
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 26.36364 10.46205 -4.673931 5.616296 123.2218 4.673931 -5.616296
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 26.36364 10.46205 0.04327184 0.2039842 5.068131 0.1922395 0.02200445
Model 3 resulted a WMAPE of 18% it is much lower than the wmape of the ARIMA model. Favored count and basket count will be added to the model.
bikini1_xreg1<-cbind(bikini1_daily_tr$favored_count,
bikini1_daily_tr$basket_count)
bikini1_xreg2<-cbind(bikini1_daily_te$favored_count,
bikini1_daily_te$basket_count)
bikini1_arimax<-Arima(bikini1_ts_daily,xreg=as.matrix(bikini1_xreg1),order=c(3,0,3))
bikini1_daily_te$forecastarimax<-forecast(bikini1_arimax,xreg=as.matrix(bikini1_xreg2))$mean
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 26.36364 10.46205 0.02295718 0.2159454 5.033602 0.1909297 0.005587715
With the addition of regressors wmape value decreased to 18.3%.
ggplot(bikini1_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Above plot shows the fitted values of the last model. Model fits well with the actual values.
bikini2<-subset(alldata, alldata$product_content_id==32737302)
#bikini2_xts<-xts(bikini2,order.by=bikini2$event_date)
bikini2_daily_tr<-bikini2[bikini2$event_date<"2021-06-21"]
bikini2_daily_te<-bikini2[bikini2$event_date>="2021-06-21"]
plot(bikini2$sold_count, type='l')
Plot above shows the sold counts for bikini2. Sales volumes shows a trend after January.
bikini2_ts_daily<-ts(bikini2_daily_tr$sold_count, freq=7)
tsdisplay(bikini2_ts_daily)
Autocorrelation plot indicates a trend and there is a significant correlation only in lag1 at PACF.
bikini2_daily_additive<-decompose(bikini2_ts_daily, type ="additive")
bikini2_daily_multiplicative<-decompose(bikini2_ts_daily, type ="multiplicative")
plot(bikini2_daily_additive)
plot(bikini2_daily_multiplicative)
For additive decomposition variance seems more constant comparing with the multiplicative decomposition. At multiplicative decomposition there is pikes at some days. The method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test1=ur.kpss(bikini2_daily_additive$random, use.lag="7")
summary(test1)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0667
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test2=ur.kpss(bikini2_daily_multiplicative$random, use.lag="7")
summary(test2)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.122
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Additive decomposition resulted better in terms of the test statistic, therefore it is chosen for further examination
plot(bikini2_daily_additive$seasonal[1:30], type='l')
plot(bikini2_daily_additive$trend, type='l')
Checking the seasonal component, we can clearly see seasonality in daily level. Plots above show closer looks for trend and seasonal components of additive decomposition. Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.
Weekly
bikini2_xts<-xts(bikini2_daily_tr$sold_count, order.by=as.Date(bikini2_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(bikini2_xts,on="weeks")
bikini2_xts_weekly<-period.apply(bikini2_xts,INDEX=weekly,FUN=mean)
plot(bikini2_xts_weekly)
Monthly
bikini2_xts<-xts(bikini2_daily_tr$sold_count, order.by=as.Date(bikini2_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(bikini2_xts,on="months")
bikini2_xts_weekly<-period.apply(bikini2_xts,INDEX=monthly,FUN=mean)
plot(bikini2_xts_weekly)
We can see the trend occurrence after January from both weekly and monthly transformed data.
To decide on the parameters of ARIMA model, autocorrelation and partial autocorrelation of the random component of additive model are investigated.
tsdisplay(bikini2_daily_additive$random, na.action = na.omit)
bikini2_model<-arima(bikini2_daily_additive$random, order = c(3,0,3))
summary(bikini2_model)
##
## Call:
## arima(x = bikini2_daily_additive$random, order = c(3, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept
## -0.0022 -0.1073 -0.0801 -0.0176 -0.4037 -0.5787 -0.0061
## s.e. 0.1235 0.0785 0.0811 0.1174 0.0618 0.1070 0.0046
##
## sigma^2 estimated as 22.61: log likelihood = -1152.51, aic = 2321.03
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.05595952 4.755316 2.640292 74.72519 144.1095 0.6245488
## ACF1
## Training set -0.01022994
As can be seen from the graphs there is significant negative autocorrelation in lag 3 at ACF and partial autocorrelation decreased significantly after lag 3 in PACF therefore parameters would be selected as (3,0,3)
bikini2_model2<-arima(bikini2_daily_additive$random, order = c(2,0,3))
AIC(bikini2_model2)
## [1] 2320.045
Model 2 is better in terms of AIC values, therefore it is chosen to construct the ARIMA model
Fitting the model
model_forecast_error <- predict(bikini2_model2, n.ahead = 11)$pred
last_trend_value <-tail(bikini2_daily_additive$trend[!is.na(bikini2_daily_additive$trend)],1)
seasonality=bikini2_daily_additive$seasonal[295:305]
bikini2_arima_forecast=model_forecast_error+seasonality+last_trend_value
bikini2_daily_te$forecast<-paste(bikini2_arima_forecast)
bikini2_daily_te$forecast<-as.numeric(bikini2_daily_te$forecast)
ggplot(bikini2_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
From above plot, it can be seen that we are overpredicting, it is mainly due to the constant trend assumption, with the addition of regressors this bias may be decreased.
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 55 16.2911 -0.506666 0.6193516 28.61947 0.520354 -0.6106982
WMAPE value is too much in this case because of the reasons mentioned above.
To decide on which regressors to use, 3 different linear models are constructed and compared in terms of their error rates.
bikini2_lm1<-lm(sold_count~price+is.discount_days, data=bikini2_daily_tr)
summary(bikini2_lm1)
##
## Call:
## lm(formula = sold_count ~ price + is.discount_days, data = bikini2_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.81 -14.97 -0.50 10.09 72.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51.198 62.497 -0.819 0.414
## price 1.370 1.039 1.318 0.189
## is.discount_days 12.834 9.283 1.383 0.169
##
## Residual standard error: 20.53 on 167 degrees of freedom
## (222 observations deleted due to missingness)
## Multiple R-squared: 0.01545, Adjusted R-squared: 0.003658
## F-statistic: 1.31 on 2 and 167 DF, p-value: 0.2725
bikini2_lm2<-lm(sold_count~category_sold+price, data=bikini2_daily_tr)
summary(bikini2_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = bikini2_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.575 -8.500 -0.676 8.639 44.648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.266e+02 4.183e+01 3.027 0.00286 **
## category_sold 9.154e-03 6.948e-04 13.173 < 2e-16 ***
## price -1.938e+00 7.044e-01 -2.751 0.00660 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 167 degrees of freedom
## (222 observations deleted due to missingness)
## Multiple R-squared: 0.5116, Adjusted R-squared: 0.5058
## F-statistic: 87.48 on 2 and 167 DF, p-value: < 2.2e-16
bikini2_lm3<-lm(sold_count~favored_count+basket_count, data=bikini2_daily_tr)
summary(bikini2_lm3)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = bikini2_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.712 -0.063 0.159 0.159 35.205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.159138 0.356595 -0.446 0.656
## favored_count -0.025855 0.002303 -11.225 <2e-16 ***
## basket_count 0.222054 0.005072 43.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.785 on 389 degrees of freedom
## Multiple R-squared: 0.9219, Adjusted R-squared: 0.9215
## F-statistic: 2294 on 2 and 389 DF, p-value: < 2.2e-16
bikini2_daily_te$forecastlm1<-predict(bikini2_lm1,bikini2_daily_te)
bikini2_daily_te$forecastlm2<-predict(bikini2_lm2,bikini2_daily_te)
bikini2_daily_te$forecastlm3<-predict(bikini2_lm3,bikini2_daily_te)
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 55 16.2911 0.4185357 0.3798232 23.01947 0.4185357 0.3798232
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 55 16.2911 -0.1482096 0.3908551 20.80109 0.3782016 -0.242806
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 55 16.2911 0.1197114 0.127294 7.265821 0.1321058 0.1088184
Model 3 resulted a WMAPE of 13.5% it is much lower than the wmape of the ARIMA model. Favored count and basket count will be added to the model.
bikini2_xreg1<-cbind(bikini2_daily_tr$favored_count,
bikini2_daily_tr$basket_count)
bikini2_xreg2<-cbind(bikini2_daily_te$favored_count,
bikini2_daily_te$basket_count)
bikini2_arimax<-Arima(bikini2_ts_daily,xreg=as.matrix(bikini2_xreg1),order=c(2,0,2))
bikini2_daily_te$forecastarimax<-forecast(bikini2_arimax,xreg=as.matrix(bikini2_xreg2))$mean
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 55 16.2911 0.005079662 0.1186635 6.154088 0.1118925 -0.02147385
With the addition of regressors wmape value decreased to 10%
ggplot(bikini2_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Above plot shows the fitted values of the last model. Model fits well with the actual values.
tayt<-subset(alldata, alldata$product_content_id==31515569)
#tayt_xts<-xts(tayt,order.by=tayt$event_date)
tayt_daily_tr<-tayt[tayt$event_date<"2021-06-21"]
tayt_daily_te<-tayt[tayt$event_date>="2021-06-21"]
plot(tayt$sold_count, type='l')
Above plot shows the sales volumes for Trendyolmilla Tayt. Tayt has been sold for the whole year and there is some pikes at certain days.
tayt_ts_daily<-ts(tayt_daily_tr$sold_count, freq=7)
tsdisplay(tayt_ts_daily)
Necessary data manipulations are made in order to obtain daily data. ACF plot shows decreasing correlation at lag 1,2 and 3 and also there is a correlation after lag 14. PACF shows only significant correlation at lag 1. It will be examined further after decomposition.
tayt_daily_additive<-decompose(tayt_ts_daily, type ="additive")
tayt_daily_multiplicative<-decompose(tayt_ts_daily, type ="multiplicative")
plot(tayt_daily_additive)
plot(tayt_daily_multiplicative)
Above plots show trend, seasonal and random components of multiplicative and additive decomposition. There is pikes at certain days in additive decomposition and constant variance assumption is violated. For multiplicative decomposition variance is higher than the additive. The method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.
test1=ur.kpss(tayt_daily_additive$random, use.lag="7")
summary(test1)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0126
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
test2=ur.kpss(tayt_daily_multiplicative$random, use.lag="7")
summary(test2)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1298
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Additive decomposition resulted better and is chosen to be used for futher examination.
Since there are not enough data points, decomposition cannot made in different levels. Although we cannot decompose, weekly and monthly plots can be seen below.
plot(tayt_daily_additive$seasonal[1:30], type='l')
plot(tayt_daily_additive$trend, type='l')
Weekly
tayt_xts<-xts(tayt_daily_tr$sold_count, order.by=as.Date(tayt_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(tayt_xts,on="weeks")
tayt_xts_weekly<-period.apply(tayt_xts,INDEX=weekly,FUN=mean)
plot(tayt_xts_weekly)
Monthly
tayt_xts<-xts(tayt_daily_tr$sold_count, order.by=as.Date(tayt_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(tayt_xts,on="months")
tayt_xts_monthly<-period.apply(tayt_xts,INDEX=monthly,FUN=mean)
plot(tayt_xts_monthly)
It can be seen that sales volumes of tayt increased at November influenced by the discount days, which are determined in our project to yield better forecast result.
tsdisplay(tayt_daily_additive$random, na.action = na.omit)
According to ACF plot, there is significant negative correlation at lag 4. Hence, moving average parameter should be 4. Also, it can be seen from PACF plot that correlation significantly decreased after lag 1 which means that AR parameter should be 1.
tayt_model<-arima(tayt_daily_additive$random, order = c(1,0,4))
AIC(tayt_model)
## [1] 5838.382
tayt_model2<-arima(tayt_daily_additive$random, order = c(1,0,3))
AIC(tayt_model2)
## [1] 5852.143
tayt_model3<-arima(tayt_daily_additive$random, order = c(1,0,2))
AIC(tayt_model3)
## [1] 5953.414
When adjacent parameters also checked, still model 1 which has (1,0,4) parameters is better in terms of AIC values, therefore it is chosen.
Fitting the model
model_forecast_error <- predict(tayt_model, n.ahead = 11)$pred
last_trend_value <-tail(tayt_daily_additive$trend[!is.na(tayt_daily_additive$trend)],1)
seasonality=tayt_daily_additive$seasonal[295:305]
tayt_arima_forecast=model_forecast_error+seasonality+last_trend_value
tayt_daily_te$forecast<-paste(tayt_arima_forecast)
tayt_daily_te$forecast<-as.numeric(tayt_daily_te$forecast)
ggplot(tayt_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. Model that we built with ARIMA (1,0,4) parameters shows peaks and downs on sales. It does not clearly explain actual sales data.
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecast)
## n mean sd bias mape mad wmape MPE
## 1 11 293.3636 82.88941 -0.09614783 0.5779659 153.5975 0.5235738 -0.1245216
According to WMAPE —————————————-
tayt_lm1<-lm(sold_count~visit_count+is.discount_days, data=tayt_daily_tr)
summary(tayt_lm1)
##
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = tayt_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3179.6 -386.1 -126.8 177.4 5935.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.513e+02 4.318e+01 17.400 < 2e-16 ***
## visit_count -7.606e-03 2.083e-03 -3.652 0.000295 ***
## is.discount_days 4.193e+03 2.021e+02 20.746 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 743.4 on 389 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.5242
## F-statistic: 216.4 on 2 and 389 DF, p-value: < 2.2e-16
tayt_lm2<-lm(sold_count~category_sold+price, data=tayt_daily_tr)
summary(tayt_lm2)
##
## Call:
## lm(formula = sold_count ~ category_sold + price, data = tayt_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2726.2 -267.0 -17.8 282.4 5051.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3493.53040 277.79024 12.58 <2e-16 ***
## category_sold 0.33963 0.01494 22.74 <2e-16 ***
## price -68.28039 5.62092 -12.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 680.3 on 389 degrees of freedom
## Multiple R-squared: 0.6036, Adjusted R-squared: 0.6016
## F-statistic: 296.2 on 2 and 389 DF, p-value: < 2.2e-16
tayt_lm3<-lm(sold_count~price+favored_count+category_sold, data=tayt_daily_tr)
summary(tayt_lm3)
##
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold,
## data = tayt_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2743.7 -259.6 -48.5 243.8 5252.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3656.61096 282.50779 12.943 < 2e-16 ***
## price -72.54566 5.80720 -12.492 < 2e-16 ***
## favored_count 0.05999 0.02271 2.642 0.00858 **
## category_sold 0.32884 0.01538 21.386 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 675.1 on 388 degrees of freedom
## Multiple R-squared: 0.6106, Adjusted R-squared: 0.6076
## F-statistic: 202.8 on 3 and 388 DF, p-value: < 2.2e-16
tayt_lm4<-lm(sold_count~favored_count+basket_count+visit_count+is.discount_days, data=tayt_daily_tr)
summary(tayt_lm4)
##
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count +
## is.discount_days, data = tayt_daily_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3437.6 -123.9 -24.3 136.8 5043.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.774e+01 3.945e+01 1.971 0.0495 *
## favored_count 2.698e-02 2.104e-02 1.283 0.2004
## basket_count 2.024e-01 9.078e-03 22.297 < 2e-16 ***
## visit_count -1.542e-02 1.796e-03 -8.586 2.22e-16 ***
## is.discount_days 1.829e+03 1.669e+02 10.959 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 474.1 on 387 degrees of freedom
## Multiple R-squared: 0.8085, Adjusted R-squared: 0.8065
## F-statistic: 408.5 on 4 and 387 DF, p-value: < 2.2e-16
tayt_daily_te$forecastlm1<-predict(tayt_lm1,tayt_daily_te)
tayt_daily_te$forecastlm2<-predict(tayt_lm2,tayt_daily_te)
tayt_daily_te$forecastlm3<-predict(tayt_lm3,tayt_daily_te)
tayt_daily_te$forecastlm4<-predict(tayt_lm4,tayt_daily_te)
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm1)
## n mean sd bias mape mad wmape MPE
## 1 11 293.3636 82.88941 -1.295053 1.447989 379.9214 1.295053 -1.447989
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm2)
## n mean sd bias mape mad wmape MPE
## 1 11 293.3636 82.88941 -4.682086 4.716494 1373.554 4.682086 -4.716494
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm3)
## n mean sd bias mape mad wmape MPE
## 1 11 293.3636 82.88941 -4.287784 4.292085 1257.88 4.287784 -4.292085
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm4)
## n mean sd bias mape mad wmape MPE
## 1 11 293.3636 82.88941 0.09496337 0.1889774 59.66235 0.2033734 0.05237606
4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square and WMAPE 4th model is explaining sales behavior better than others. Therefore, 4th model’s regressors will be chosen which are favored count, basket count, visit count and discount days.
tayt_xreg1<-cbind(tayt_daily_tr$favored_count,
tayt_daily_tr$basket_count,
tayt_daily_tr$visit_count,
tayt_daily_tr$is.discount_days)
tayt_xreg2<-cbind(tayt_daily_te$favored_count,
tayt_daily_te$basket_count,
tayt_daily_te$visit_count,
tayt_daily_te$is.discount_days)
tayt_arimax<-Arima(tayt_ts_daily,xreg=as.matrix(tayt_xreg1),order=c(1,0,4))
tayt_daily_te$forecastarimax<-forecast(tayt_arimax,xreg=as.matrix(tayt_xreg2))$mean
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastarimax)
## n mean sd bias mape mad wmape MPE
## 1 11 293.3636 82.88941 0.03536987 0.1820554 54.7053 0.1864761 -0.01095385
WMAPE is 0.17 which is little less than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.
ggplot(tayt_daily_te ,aes(x=event_date)) +
geom_line(aes(y=sold_count,color='actual')) +
geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales",
x = "Time",
y = "Sales"
)+
theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0))
There is ultra difference in forecasted values between ARIMA and ARIMAX models. As can be seen from plot above, regressors improved model.
In this study forecasting models are constructed for various products that are being sold in Trendyol. Products were decomposed to their seasonal and trend components, after deciding on the appropriate decomposition technique, random components of the products are evaluated in terms of their autocorrelation and partial autocorrelation plots to choose the best ARIMA model. Potential regressors are investigated and added to the model according to their WMAPE values. In overall, models resulted well and they can be used in forecasting the upcoming sales volumes. Some of assumptions made such as taking the last trend value and considering it as constant reduced the overall performance of the models, and also since we used the regressor values from test set instead of predicting them, error rates of the real-time forecasts would be higher than the values found in this study. If future product prices were given, more precise forecasting might be achieved. Furthermore, because the data had a large number of missing values, using such variables as regressors was useless. We haven’t used other sources like Google Trends to back up our forecasts since we all agreed that sometimes the best data to use is just the basic facts. Using other resources may result in better results.